Project Objective, Overview & Research
¶
Project Objective, Overview & Research
- Environmental Health and Ecosystem Balance Assessment: By monitoring the populations of bats and butterflies, we can gain insights into the overall health of the ecosystem in Melbourne's parks. These species are often considered indicator species[1], meaning changes in their populations can signal changes in the broader environmental conditions. For instance, bats play a crucial role in controlling insect populations and pollinating plants[2], while butterflies are indicators of ecological diversity and health. Studying these populations helps in understanding the impacts of urban development on natural habitats and the effectiveness of current conservation strategies.
- Informed Conservation and Urban Planning: The data gathered will be crucial in guiding conservation efforts and urban planning decisions. Understanding where and how these species thrive can inform the development of green spaces that support biodiversity. This can lead to the creation of urban environments that are not only beneficial for wildlife but also enhance the quality of life for city residents. For example, identifying key habitats and migration patterns of these species can aid in designing parks and green corridors that promote their conservation.
- Public Engagement and Education: This research can also play a significant role in public education and engagement. By raising awareness about the importance of biodiversity in urban areas, We can foster a greater appreciation and understanding among the public. This can lead to increased community involvement in conservation efforts and sustainable practices[5]. Moreover, it can help in promoting citizen science initiatives, where locals can contribute to data collection and monitoring, further expanding the scope and impact of your research.
Part 2 (Next 3 Weeks): This phase dives into detailed trend analysis and mapping. It aims to uncover patterns in the distribution of these species, hypothesize the reasons for their specific locations, and understand the impact of urban environments on them. The project concludes with strategic recommendations for biodiversity conservation and urban planning.
References
Part 1 (Set up & Pre-processing)¶
- Set Up
- Pre-processing
Part 1.1: Set Up¶
- Import Core Libraries
- Import Dependencies
##### Install packages
!pip install osmnx
!pip install tqdm
Requirement already satisfied: osmnx in /opt/anaconda3/lib/python3.11/site-packages (1.9.2) Requirement already satisfied: geopandas>=0.12 in /opt/anaconda3/lib/python3.11/site-packages (from osmnx) (0.14.3) Requirement already satisfied: networkx>=2.5 in /opt/anaconda3/lib/python3.11/site-packages (from osmnx) (3.1) Requirement already satisfied: numpy>=1.20 in /opt/anaconda3/lib/python3.11/site-packages (from osmnx) (1.26.4) Requirement already satisfied: pandas>=1.1 in /opt/anaconda3/lib/python3.11/site-packages (from osmnx) (2.1.4) Requirement already satisfied: requests>=2.27 in /opt/anaconda3/lib/python3.11/site-packages (from osmnx) (2.31.0) Requirement already satisfied: shapely>=2.0 in /opt/anaconda3/lib/python3.11/site-packages (from osmnx) (2.0.3) Requirement already satisfied: fiona>=1.8.21 in /opt/anaconda3/lib/python3.11/site-packages (from geopandas>=0.12->osmnx) (1.9.6) Requirement already satisfied: packaging in /opt/anaconda3/lib/python3.11/site-packages (from geopandas>=0.12->osmnx) (23.1) Requirement already satisfied: pyproj>=3.3.0 in /opt/anaconda3/lib/python3.11/site-packages (from geopandas>=0.12->osmnx) (3.6.1) Requirement already satisfied: python-dateutil>=2.8.2 in /opt/anaconda3/lib/python3.11/site-packages (from pandas>=1.1->osmnx) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.11/site-packages (from pandas>=1.1->osmnx) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in /opt/anaconda3/lib/python3.11/site-packages (from pandas>=1.1->osmnx) (2023.3) Requirement already satisfied: charset-normalizer<4,>=2 in /opt/anaconda3/lib/python3.11/site-packages (from requests>=2.27->osmnx) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in /opt/anaconda3/lib/python3.11/site-packages (from requests>=2.27->osmnx) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/anaconda3/lib/python3.11/site-packages (from requests>=2.27->osmnx) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/lib/python3.11/site-packages (from requests>=2.27->osmnx) (2024.2.2) Requirement already satisfied: attrs>=19.2.0 in /opt/anaconda3/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas>=0.12->osmnx) (23.1.0) Requirement already satisfied: click~=8.0 in /opt/anaconda3/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas>=0.12->osmnx) (8.1.7) Requirement already satisfied: click-plugins>=1.0 in /opt/anaconda3/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas>=0.12->osmnx) (1.1.1) Requirement already satisfied: cligj>=0.5 in /opt/anaconda3/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas>=0.12->osmnx) (0.7.2) Requirement already satisfied: six in /opt/anaconda3/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas>=0.12->osmnx) (1.16.0) Requirement already satisfied: tqdm in /opt/anaconda3/lib/python3.11/site-packages (4.65.0)
#Import core libraries
import requests
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
import json
import ipywidgets as widgets
from ipywidgets import interact
import osmnx as ox
import geopandas as gpd
import networkx as nx
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")
# Define the company colors format for matplotlib
dark_theme_colors = ['#08af64', '#14a38e', '#0f9295', '#056b8a', '#121212'] #Dark theme
light_theme_colors = ['#2af598', '#22e4ac', '#1bd7bb', '#14c9cb', '#0fbed8', '#08b3e5'] #Light theme
def fetch_data(base_url, dataset, api_key, num_records=99, offset=0):
all_records = []
max_offset = 9900
while True:
if offset > max_offset:
break
filters = f'{dataset}/records?limit={num_records}&offset={offset}'
url = f'{base_url}{filters}&api_key={api_key}'
try:
result = requests.get(url, timeout = 10)
result.raise_for_status()
records = result.json().get('results')
except requests.exceptions.RequestException as e:
raise Exception(f'API request failed: {e}')
if records is None:
break
all_records.extend(records)
if len(records) < num_records:
break
offset += num_records
df = pd.DataFrame(all_records)
return df
BASE_URL = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
API_KEY = ''
Part 1.2: Pre-Processing¶
- Fetch each dataset, CoM API or load CSV
- Load data to dataframe
- Data cleaning (duplicates, missing values, data types, etc)
- Save cleaned dataset
- Encode Data
- Save encoded dataset
- Correlations
- Merge datasets
Dataset 1: Bat records in fitzroy gardens and royal botanic gardens 2010¶
- Dataset Identifier: bat-records-in-fitzroy-gardens-and-royal-botanic-gardens-2010
Summary: This dataset provides detailed observations of various bat species within Fitzroy Gardens and Royal Botanic Gardens over the year 2010. It includes taxonomic classification, common names, and exact locations within the parks.
This dataset includes observations of bats categorized by taxa, genus, and species with specific geo-spatial information, indicating precise observation spots within the parks. The data also lists common names alongside the scientific taxonomy to assist in species identification.
Note: Each observation is pinpointed with latitude and longitude coordinates, providing exact locations but not the area affected or the size of the bat populations.
Note: This dataset is crucial for ecological monitoring and conservation efforts, helping track bat populations and distribution within urban parklands.
SENSOR_DATASET = 'bat-records-in-fitzroy-gardens-and-royal-botanic-gardens-2010'
bat = fetch_data(BASE_URL, SENSOR_DATASET, API_KEY)
bat.head()
| taxa | kingdom | phylum | class | order | family | genus | species | common_name | park_name | location | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mammal | ANIMALIA | CHORDATA | MAMMALIA | CHIROPTERA | MOLOSSIDAE | Mormopterus | None | None | Royal Botanic Gardens | {'lon': 144.9804, 'lat': -37.8312} |
| 1 | Mammal | ANIMALIA | CHORDATA | MAMMALIA | CHIROPTERA | VESPERTILIONIDAE | Chalinolobus | gouldii | Gould's Wattled Bat | Fitzroy Gardens | {'lon': 144.9786, 'lat': -37.8135} |
| 2 | Mammal | ANIMALIA | CHORDATA | MAMMALIA | CHIROPTERA | VESPERTILIONIDAE | Chalinolobus | gouldii | Gould's Wattled Bat | Royal Botanic Gardens | {'lon': 144.9804, 'lat': -37.8312} |
| 3 | Mammal | ANIMALIA | CHORDATA | MAMMALIA | CHIROPTERA | MOLOSSIDAE | Austronomous | australis | White-striped Freetail Bat | Fitzroy Gardens | {'lon': 144.9786, 'lat': -37.8135} |
| 4 | Mammal | ANIMALIA | CHORDATA | MAMMALIA | CHIROPTERA | MOLOSSIDAE | Mormopterus | None | None | Fitzroy Gardens | {'lon': 144.9786, 'lat': -37.8135} |
#View info of data
bat.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 taxa 10 non-null object 1 kingdom 10 non-null object 2 phylum 10 non-null object 3 class 10 non-null object 4 order 10 non-null object 5 family 10 non-null object 6 genus 10 non-null object 7 species 5 non-null object 8 common_name 5 non-null object 9 park_name 10 non-null object 10 location 10 non-null object dtypes: object(11) memory usage: 1012.0+ bytes
# Check missing values for dataset
missing_values = bat.isnull().sum()
missing_values # Number of missing values in each column
taxa 0 kingdom 0 phylum 0 class 0 order 0 family 0 genus 0 species 5 common_name 5 park_name 0 location 0 dtype: int64
# Column names to check for missing values
column_name_species = 'species'
column_name_common_name = 'common_name'
# Calculate the percentage of missing values
percentage_missing_species = (missing_values[column_name_species] / len(bat)) * 100
percentage_missing_common_name = (missing_values[column_name_common_name] / len(bat)) * 100
# Print the results
print(f"Percentage of missing values for '{column_name_species}': {percentage_missing_species:.2f}%")
print(f"Percentage of missing values for '{column_name_common_name}': {percentage_missing_common_name:.2f}%")
Percentage of missing values for 'species': 50.00% Percentage of missing values for 'common_name': 50.00%
Cleaning bat dataset¶
# Checking and filling missing values if the columns exist
if 'Species' in bat.columns:
bat['Species'] = bat['Species'].replace('Unknown', np.nan)
if 'Common Name' in bat.columns:
bat['Common Name'] = bat['Common Name'].replace('Unknown', np.nan)
# Displaying the cleaned data information and the first few rows
print(bat.info())
print(bat.head())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 taxa 10 non-null object
1 kingdom 10 non-null object
2 phylum 10 non-null object
3 class 10 non-null object
4 order 10 non-null object
5 family 10 non-null object
6 genus 10 non-null object
7 species 5 non-null object
8 common_name 5 non-null object
9 park_name 10 non-null object
10 location 10 non-null object
dtypes: object(11)
memory usage: 1012.0+ bytes
None
taxa kingdom phylum class order family \
0 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA MOLOSSIDAE
1 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA VESPERTILIONIDAE
2 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA VESPERTILIONIDAE
3 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA MOLOSSIDAE
4 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA MOLOSSIDAE
genus species common_name park_name \
0 Mormopterus None None Royal Botanic Gardens
1 Chalinolobus gouldii Gould's Wattled Bat Fitzroy Gardens
2 Chalinolobus gouldii Gould's Wattled Bat Royal Botanic Gardens
3 Austronomous australis White-striped Freetail Bat Fitzroy Gardens
4 Mormopterus None None Fitzroy Gardens
location
0 {'lon': 144.9804, 'lat': -37.8312}
1 {'lon': 144.9786, 'lat': -37.8135}
2 {'lon': 144.9804, 'lat': -37.8312}
3 {'lon': 144.9786, 'lat': -37.8135}
4 {'lon': 144.9786, 'lat': -37.8135}
# Print the column names to verify the correct column exists
print(bat.columns)
Index(['taxa', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus',
'species', 'common_name', 'park_name', 'location'],
dtype='object')
Matching column names for data merge¶
# Renaming the 'location' column to 'geopoint'
bat.rename(columns={'location': 'geopoint'}, inplace=True)
# Displaying the cleaned data information and the first few rows
print(bat.info())
print(bat.head())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 taxa 10 non-null object
1 kingdom 10 non-null object
2 phylum 10 non-null object
3 class 10 non-null object
4 order 10 non-null object
5 family 10 non-null object
6 genus 10 non-null object
7 species 5 non-null object
8 common_name 5 non-null object
9 park_name 10 non-null object
10 geopoint 10 non-null object
dtypes: object(11)
memory usage: 1012.0+ bytes
None
taxa kingdom phylum class order family \
0 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA MOLOSSIDAE
1 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA VESPERTILIONIDAE
2 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA VESPERTILIONIDAE
3 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA MOLOSSIDAE
4 Mammal ANIMALIA CHORDATA MAMMALIA CHIROPTERA MOLOSSIDAE
genus species common_name park_name \
0 Mormopterus None None Royal Botanic Gardens
1 Chalinolobus gouldii Gould's Wattled Bat Fitzroy Gardens
2 Chalinolobus gouldii Gould's Wattled Bat Royal Botanic Gardens
3 Austronomous australis White-striped Freetail Bat Fitzroy Gardens
4 Mormopterus None None Fitzroy Gardens
geopoint
0 {'lon': 144.9804, 'lat': -37.8312}
1 {'lon': 144.9786, 'lat': -37.8135}
2 {'lon': 144.9804, 'lat': -37.8312}
3 {'lon': 144.9786, 'lat': -37.8135}
4 {'lon': 144.9786, 'lat': -37.8135}
Dataset 2: Butterfly biodiversity survey 2017¶
- Dataset Identifier: butterfly-biodiversity-survey-2017
Summary: Comprehensive survey data capturing observations of butterflies across various sites within Melbourne for the year 2017. Details include environmental conditions, vegetation types, and specific butterfly sightings.
The dataset provides granular details such as temperature, humidity, vegetation details, and wind conditions at the time of each observation, along with precise geographic coordinates. It aims to aid research into butterfly populations and their responses to urban environments.
Note: Data points include specific environmental conditions and butterfly species observations to better understand the impact of urban settings on biodiversity.
Note: The dataset is useful for ecological research and conservation planning, providing essential data for studies on biodiversity in urban parks.
SENSOR_DATASET = 'butterfly-biodiversity-survey-2017'
butterfly = fetch_data(BASE_URL, SENSOR_DATASET, API_KEY)
butterfly.head()
| site | sloc | walk | date | time | vegwalktime | vegspecies | vegfamily | lat | lon | ... | tabe | brow | csem | aand | jvil | paur | ogyr | gmac | datetime | location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Womens Peace Gardens | 2 | 1 | 2017-02-26 | 0001-01-01T11:42:00+00:00 | 1.3128 | Schinus molle | Anacardiaceae | -37.7912 | 144.9244 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-02-26T11:42:00+00:00 | {'lon': 144.9244, 'lat': -37.7912} |
| 1 | Argyle Square | 1 | 1 | 2017-11-02 | 0001-01-01T10:30:00+00:00 | 0.3051 | Rosmarinus officinalis | Lamiaceae | -37.8023 | 144.9665 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-02-11T10:30:00+00:00 | {'lon': 144.9665, 'lat': -37.8023} |
| 2 | Argyle Square | 2 | 1 | 2017-12-01 | 0001-01-01T10:35:00+00:00 | 0.3620 | Euphorbia sp. | Euphorbiaceae | -37.8026 | 144.9665 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-01-12T10:35:00+00:00 | {'lon': 144.9665, 'lat': -37.8026} |
| 3 | Westgate Park | 4 | 1 | 2017-03-03 | 0001-01-01T11:44:00+00:00 | 3.1585 | Melaleuca lanceolata | Myrtaceae | -37.8316 | 144.9089 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-03-03T11:44:00+00:00 | {'lon': 144.9089, 'lat': -37.8316} |
| 4 | Argyle Square | 1 | 3 | 2017-01-15 | 0001-01-01T12:33:00+00:00 | 0.4432 | Mentha sp. | Lamiaceae | -37.8027 | 144.9662 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-01-15T12:33:00+00:00 | {'lon': 144.9662, 'lat': -37.8027} |
5 rows × 42 columns
#View info of data
butterfly.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4056 entries, 0 to 4055 Data columns (total 42 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 site 4056 non-null object 1 sloc 4056 non-null int64 2 walk 4056 non-null int64 3 date 4056 non-null object 4 time 4056 non-null object 5 vegwalktime 4052 non-null float64 6 vegspecies 4056 non-null object 7 vegfamily 4056 non-null object 8 lat 4056 non-null float64 9 lon 4056 non-null float64 10 temp 4056 non-null float64 11 hum 4056 non-null float64 12 win1 4056 non-null float64 13 win2 4056 non-null float64 14 win3 4056 non-null float64 15 win4 4056 non-null float64 16 win 4056 non-null float64 17 per 4056 non-null int64 18 sur 4056 non-null int64 19 prap 4056 non-null int64 20 vker 4056 non-null int64 21 vite 4056 non-null int64 22 blue 4056 non-null int64 23 dpet 4056 non-null int64 24 dple 4056 non-null int64 25 pana 4056 non-null int64 26 pdem 4056 non-null int64 27 hesp 4056 non-null int64 28 esmi 4056 non-null int64 29 cato 4056 non-null int64 30 gaca 4056 non-null int64 31 belo 4056 non-null int64 32 tabe 4056 non-null int64 33 brow 4056 non-null int64 34 csem 4056 non-null int64 35 aand 4056 non-null int64 36 jvil 4056 non-null int64 37 paur 4056 non-null int64 38 ogyr 4056 non-null int64 39 gmac 4056 non-null int64 40 datetime 4056 non-null object 41 location 4056 non-null object dtypes: float64(10), int64(25), object(7) memory usage: 1.3+ MB
# Check missing values for dataset
missing_values = butterfly.isnull().sum()
missing_values # Number of missing values in each column
site 0 sloc 0 walk 0 date 0 time 0 vegwalktime 4 vegspecies 0 vegfamily 0 lat 0 lon 0 temp 0 hum 0 win1 0 win2 0 win3 0 win4 0 win 0 per 0 sur 0 prap 0 vker 0 vite 0 blue 0 dpet 0 dple 0 pana 0 pdem 0 hesp 0 esmi 0 cato 0 gaca 0 belo 0 tabe 0 brow 0 csem 0 aand 0 jvil 0 paur 0 ogyr 0 gmac 0 datetime 0 location 0 dtype: int64
Matching columns for merging data¶
# Renaming columns for consistency
butterfly = butterfly.rename(columns={
'date': 'sighting_date',
'lat': 'latitude',
'lon': 'longitude',
'location': 'geopoint'
})
# Converting 'datetime' column to datetime type and extracting the time part
butterfly['datetime'] = pd.to_datetime(butterfly['datetime'])
butterfly['time'] = butterfly['datetime'].dt.time
# Updating the 'geopoint' column to ensure correct format
butterfly['geopoint'] = butterfly.apply(
lambda row: f"{row['latitude']}, {row['longitude']}", axis=1
)
# Display the first few rows of the cleaned dataset
butterfly.head()
| site | sloc | walk | sighting_date | time | vegwalktime | vegspecies | vegfamily | latitude | longitude | ... | tabe | brow | csem | aand | jvil | paur | ogyr | gmac | datetime | geopoint | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Womens Peace Gardens | 2 | 1 | 2017-02-26 | 11:42:00 | 1.3128 | Schinus molle | Anacardiaceae | -37.7912 | 144.9244 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-02-26 11:42:00+00:00 | -37.7912, 144.9244 |
| 1 | Argyle Square | 1 | 1 | 2017-11-02 | 10:30:00 | 0.3051 | Rosmarinus officinalis | Lamiaceae | -37.8023 | 144.9665 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-02-11 10:30:00+00:00 | -37.8023, 144.9665 |
| 2 | Argyle Square | 2 | 1 | 2017-12-01 | 10:35:00 | 0.3620 | Euphorbia sp. | Euphorbiaceae | -37.8026 | 144.9665 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-01-12 10:35:00+00:00 | -37.8026, 144.9665 |
| 3 | Westgate Park | 4 | 1 | 2017-03-03 | 11:44:00 | 3.1585 | Melaleuca lanceolata | Myrtaceae | -37.8316 | 144.9089 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-03-03 11:44:00+00:00 | -37.8316, 144.9089 |
| 4 | Argyle Square | 1 | 3 | 2017-01-15 | 12:33:00 | 0.4432 | Mentha sp. | Lamiaceae | -37.8027 | 144.9662 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-01-15 12:33:00+00:00 | -37.8027, 144.9662 |
5 rows × 42 columns
Dataset 3: Bioblitz 2016¶
- Dataset Identifier: bioblitz-2016
Summary: This dataset records a variety of living organisms spotted during the BioBlitz event in Melbourne in 2016. It catalogues diverse species ranging from molluscs to annelids and includes detailed taxonomic information.
Data entries are detailed with the taxonomy from kingdom to species level where available, common names, and precise geo-coordinates of each sighting. Identification notes and resource names provide context about the sighting sources and identification methods.
Note: Observations were gathered through community and expert contributions during the BioBlitz event, aimed at cataloguing as many species as possible within a short time frame.
Note: This dataset serves as a valuable resource for ecological studies and environmental education, offering insights into the local biodiversity of Melbourne.
SENSOR_DATASET = 'bioblitz-2016'
bio = fetch_data(BASE_URL, SENSOR_DATASET, API_KEY)
bio.head()
| taxa | kingdom | phylum | class | order | family | genus | species | common_name | identification_notes | data_resource_name | sighting_date | latitude | longitude | location | geopoint | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mollusc | ANIMALIA | MOLLUSCA | None | None | None | None | None | None | None | Participate Melbourne | 2016-03-02 | -37.8298 | 144.9002 | None | {'lon': 144.9002, 'lat': -37.8298} |
| 1 | Insect | ANIMALIA | ARTHROPODA | INSECTA | None | None | None | None | None | Insect | Bowerbird | 2016-03-20 | -37.8185 | 144.9748 | None | {'lon': 144.9748, 'lat': -37.8185} |
| 2 | Annelid | ANIMALIA | ANNELIDA | OLIGOCHAETA | None | None | None | None | None | Earthworm | Handwritten | 2016-03-04 | -37.8060 | 144.9710 | None | {'lon': 144.971, 'lat': -37.806} |
| 3 | Annelid | ANIMALIA | ANNELIDA | OLIGOCHAETA | None | None | None | None | None | Freshwater Oligochaete Worm | Handwritten | 2016-03-04 | -37.8290 | 144.9800 | None | {'lon': 144.98, 'lat': -37.829} |
| 4 | Amphibian | ANIMALIA | CHORDATA | AMPHIBIA | ANURA | HYLIDAE | Litoria | None | None | Tiny Frog | Participate Melbourne | 2016-03-16 | -37.8204 | 145.2496 | None | {'lon': 145.2496, 'lat': -37.8204} |
#View info of data
bio.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1356 entries, 0 to 1355 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 taxa 1353 non-null object 1 kingdom 1353 non-null object 2 phylum 1322 non-null object 3 class 1314 non-null object 4 order 1177 non-null object 5 family 1139 non-null object 6 genus 953 non-null object 7 species 846 non-null object 8 common_name 802 non-null object 9 identification_notes 588 non-null object 10 data_resource_name 1356 non-null object 11 sighting_date 1350 non-null object 12 latitude 1356 non-null float64 13 longitude 1353 non-null float64 14 location 0 non-null object 15 geopoint 1353 non-null object dtypes: float64(2), object(14) memory usage: 169.6+ KB
# Check missing values for dataset
missing_values = bio.isnull().sum()
missing_values # Number of missing values in each column
taxa 3 kingdom 3 phylum 34 class 42 order 179 family 217 genus 403 species 510 common_name 554 identification_notes 768 data_resource_name 0 sighting_date 6 latitude 0 longitude 3 location 1356 geopoint 3 dtype: int64
Cleaning the Bioblitz dataset.¶
# Filling missing values for categorical data with 'Unknown'
categorical_columns = bio.select_dtypes(include=['object']).columns
bio[categorical_columns] = bio[categorical_columns].fillna('Unknown')
# Removing the 'Location' column if it exists in the DataFrame
if 'Location' in bio.columns:
bio = bio.drop(columns=['Location'])
# Dropping rows with essential missing data in 'Taxa' and 'Kingdom'
if 'Taxa' in bio.columns and 'Kingdom' in bio.columns:
bio = bio.dropna(subset=['Taxa', 'Kingdom'])
# Displaying the cleaned data information and the first few rows
print(bio.info())
print(bio.head())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1356 entries, 0 to 1355
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 taxa 1356 non-null object
1 kingdom 1356 non-null object
2 phylum 1356 non-null object
3 class 1356 non-null object
4 order 1356 non-null object
5 family 1356 non-null object
6 genus 1356 non-null object
7 species 1356 non-null object
8 common_name 1356 non-null object
9 identification_notes 1356 non-null object
10 data_resource_name 1356 non-null object
11 sighting_date 1356 non-null object
12 latitude 1356 non-null float64
13 longitude 1353 non-null float64
14 location 1356 non-null object
15 geopoint 1356 non-null object
dtypes: float64(2), object(14)
memory usage: 169.6+ KB
None
taxa kingdom phylum class order family genus \
0 Mollusc ANIMALIA MOLLUSCA Unknown Unknown Unknown Unknown
1 Insect ANIMALIA ARTHROPODA INSECTA Unknown Unknown Unknown
2 Annelid ANIMALIA ANNELIDA OLIGOCHAETA Unknown Unknown Unknown
3 Annelid ANIMALIA ANNELIDA OLIGOCHAETA Unknown Unknown Unknown
4 Amphibian ANIMALIA CHORDATA AMPHIBIA ANURA HYLIDAE Litoria
species common_name identification_notes data_resource_name \
0 Unknown Unknown Unknown Participate Melbourne
1 Unknown Unknown Insect Bowerbird
2 Unknown Unknown Earthworm Handwritten
3 Unknown Unknown Freshwater Oligochaete Worm Handwritten
4 Unknown Unknown Tiny Frog Participate Melbourne
sighting_date latitude longitude location \
0 2016-03-02 -37.8298 144.9002 Unknown
1 2016-03-20 -37.8185 144.9748 Unknown
2 2016-03-04 -37.8060 144.9710 Unknown
3 2016-03-04 -37.8290 144.9800 Unknown
4 2016-03-16 -37.8204 145.2496 Unknown
geopoint
0 {'lon': 144.9002, 'lat': -37.8298}
1 {'lon': 144.9748, 'lat': -37.8185}
2 {'lon': 144.971, 'lat': -37.806}
3 {'lon': 144.98, 'lat': -37.829}
4 {'lon': 145.2496, 'lat': -37.8204}
Merge Bat and Bioblitz dataset¶
# Merge the two datasets on shared columns
merged = pd.merge(bio, bat, on=['taxa', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'common_name'],
how='outer')
# Dropping specified columns
columns_to_drop = ['identification_notes', 'data_resource_name', 'sighting_date',
'latitude', 'longitude', 'location', 'park_name']
merged_cleaned = merged.drop(columns=columns_to_drop)
# Combining geopoint columns
merged_cleaned['geopoint'] = merged_cleaned['geopoint_x'].combine_first(merged_cleaned['geopoint_y'])
merged_cleaned = merged_cleaned.drop(columns=['geopoint_x', 'geopoint_y'])
# Save the cleaned merged dataset
merged_cleaned.to_csv('/Users/francisrusli/desktop/merged.csv', index=False)
merged_cleaned.head()
| taxa | kingdom | phylum | class | order | family | genus | species | common_name | geopoint | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mollusc | ANIMALIA | MOLLUSCA | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | {'lon': 144.9002, 'lat': -37.8298} |
| 1 | Mollusc | ANIMALIA | MOLLUSCA | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | {'lon': 133.7751, 'lat': -25.2744} |
| 2 | Mollusc | ANIMALIA | MOLLUSCA | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | {'lon': 144.9376, 'lat': -37.7729} |
| 3 | Insect | ANIMALIA | ARTHROPODA | INSECTA | Unknown | Unknown | Unknown | Unknown | Unknown | {'lon': 144.9748, 'lat': -37.8185} |
| 4 | Insect | ANIMALIA | ARTHROPODA | INSECTA | Unknown | Unknown | Unknown | Unknown | Unknown | {'lon': 144.9634, 'lat': -37.7908} |
Part 2 (Analysis)¶
Removing unknown classes and show unique values¶
# Assuming 'Unknown' or NaN are the placeholders for unknown values in the 'class' column
merged_cleaned = merged_cleaned[merged_cleaned['class'].notna()]
merged_cleaned = merged_cleaned[merged_cleaned['class'] != 'Unknown']
# You can check the effect by looking at the unique values in the 'class' column again
print(merged_cleaned['class'].unique())
['INSECTA' 'OLIGOCHAETA' 'AMPHIBIA' 'ARACHNIDA' 'EQUISETOPSIDA' 'POLYCHAETA' 'CLITELLATA' 'NEMERTINEA' 'CHONDRICHTHYES' 'AVES' 'AGARICOMYCETES' 'ANTHOZOA' 'MALACOSTRACA' 'MAXILLOPODA' 'DIPLOPOD' 'ASTEROIDEA' 'ACTINOPTERYGII' 'LECANOROMYCETES' 'BIVALVIA' 'MAMMALIA' 'GASTROPODA' 'BRYOPSIDOPHYCEAE' 'FLORIDEOPHYCEAE' 'GINKGOOPSIDA' 'REPTILIA' 'ASCIDIACEA' 'SCYPHOZOA' 'OSTROCODA' 'ECHINOIDEA' 'LILIOPSIDA' 'CHILOPODA' 'ECHIURIDEA' 'ULVOPHYCEAE']
Creating a histogram of the different classes¶
import matplotlib.pyplot as plt
# Assuming 'data' is your DataFrame
class_counts = merged_cleaned['class'].value_counts()
plt.figure(figsize=(10, 5))
class_counts.plot(kind='bar', color='skyblue')
plt.title('Histogram of Class')
plt.xlabel('Class')
plt.ylabel('Counts')
plt.xticks(rotation=90)
plt.show()
Pie Chart of all the Phylum¶
# Calculate the value counts for the 'phylum' column
phylum_counts = merged_cleaned['phylum'].value_counts()
# Define a threshold for 'Other' category. You can adjust the threshold as needed.
threshold_percent = 5 # Percentage considered as 'Other'
other_threshold = sum(phylum_counts) * (threshold_percent / 100)
# Combine smaller categories into 'Other'
other = phylum_counts[phylum_counts < other_threshold].sum()
main_phylum_counts = phylum_counts[phylum_counts >= other_threshold]
main_phylum_counts['Other'] = other
# Create a pie chart
plt.figure(figsize=(8, 8))
plt.pie(main_phylum_counts, labels=main_phylum_counts.index, autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('Pie Chart of Phylum with "Other" Category')
plt.show()
Splitting geopoint column into lat an long¶
# Function to safely extract longitude and latitude
def extract_lat_lon(geopoint):
if isinstance(geopoint, str):
try:
geopoint = eval(geopoint) # Convert string to dict
except:
return None, None # Return None if eval fails
# Check if the dictionary has 'lat' and 'lon' keys
if isinstance(geopoint, dict) and 'lon' in geopoint and 'lat' in geopoint:
return geopoint['lat'], geopoint['lon']
else:
return None, None # Return None if keys are not present
# Apply the function to the 'geopoint' column
merged_cleaned[['latitude', 'longitude']] = merged_cleaned['geopoint'].apply(extract_lat_lon).apply(pd.Series)
# Show the head of the DataFrame to confirm the new columns
print(merged_cleaned.head())
taxa kingdom phylum class order family genus species \
3 Insect ANIMALIA ARTHROPODA INSECTA Unknown Unknown Unknown Unknown
4 Insect ANIMALIA ARTHROPODA INSECTA Unknown Unknown Unknown Unknown
5 Insect ANIMALIA ARTHROPODA INSECTA Unknown Unknown Unknown Unknown
6 Insect ANIMALIA ARTHROPODA INSECTA Unknown Unknown Unknown Unknown
7 Insect ANIMALIA ARTHROPODA INSECTA Unknown Unknown Unknown Unknown
common_name geopoint latitude longitude
3 Unknown {'lon': 144.9748, 'lat': -37.8185} -37.8185 144.9748
4 Unknown {'lon': 144.9634, 'lat': -37.7908} -37.7908 144.9634
5 Unknown {'lon': 144.9706, 'lat': -37.8216} -37.8216 144.9706
6 Unknown {'lon': 144.9606, 'lat': -37.7986} -37.7986 144.9606
7 Unknown {'lon': 144.9564, 'lat': -37.7918} -37.7918 144.9564
# Drop the original 'geopoint' column
merged_cleaned.drop(columns=['geopoint'], inplace=True)
Taxa Distribution¶
# Plot a histogram for the distribution of different taxa
plt.figure(figsize=(10, 6))
merged_cleaned['taxa'].value_counts().plot(kind='bar', color='skyblue')
plt.title('Distribution of Taxa in Melbourne\'s Parks and Green Spaces')
plt.xlabel('Taxa')
plt.ylabel('Frequency')
plt.xticks(rotation=90)
plt.grid(True)
plt.show()
Inspect dataset for taxa¶
import folium
# Assuming you've already cleaned and prepared 'merged_cleaned' DataFrame
melbourne_coordinates = [-37.814, 144.96332]
# Create a Folium map centered around Melbourne
m = folium.Map(location=melbourne_coordinates, zoom_start=12)
# Print unique taxa
unique_taxa = merged_cleaned['taxa'].unique()
print("Unique Taxa in the Dataset:")
print(unique_taxa)
# Define a color mapping for different taxa
taxa_mapping = {
'Mollusc': {'color': 'green', 'icon': 'glyphicon-leaf'},
'Insect': {'color': 'red', 'icon': 'glyphicon-bug'},
'Bird': {'color': 'blue', 'icon': 'glyphicon-bird'},
'Mammal': {'color': 'gray', 'icon': 'glyphicon-knight'},
# Add other taxa and customize icons as needed
}
# Loop through each row in the DataFrame to add markers, ensuring no NaN coordinates
for _, row in merged_cleaned.dropna(subset=['latitude', 'longitude']).iterrows():
lat, lng = row['latitude'], row['longitude']
taxa = row['taxa']
if taxa in taxa_mapping:
marker_color = taxa_mapping[taxa]['color']
marker_icon = taxa_mapping[taxa]['icon']
else:
marker_color = 'purple' # default color
marker_icon = 'glyphicon-question-sign' # default icon
# Add markers to the map
folium.Marker(
location=[lat, lng],
popup=f"{taxa} - {row['common_name']}",
icon=folium.Icon(color=marker_color, icon=marker_icon)
).add_to(m)
# Display the map
m
Unique Taxa in the Dataset: ['Insect' 'Annelid' 'Amphibian' 'Arachnid' 'Plant' 'Stingray' 'Bird' 'Fungi' 'Cnidaria' 'Crustacean' 'Diplopod' 'Echinoderm' 'Fish' 'Mollusc' 'Lichen' 'Mammal' 'Reptile' 'Ascidian' 'Nematode' 'Chilopod' 'Echiura']
Focusing only on bats¶
Chiroptera is the name of the order of the only mammal capable of true flight, the bat. The name is influenced by the hand-like wings of bats, which are formed from four elongated "fingers" covered by a cutaneous membrane.
Vespertilionidae is a family of microbats, of the order Chiroptera, flying, insect-eating mammals variously described as the common, vesper, or simple nosed bats.
merged_cleaned
| taxa | kingdom | phylum | class | order | family | genus | species | common_name | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | Insect | ANIMALIA | ARTHROPODA | INSECTA | Unknown | Unknown | Unknown | Unknown | Unknown | -37.8185 | 144.9748 |
| 4 | Insect | ANIMALIA | ARTHROPODA | INSECTA | Unknown | Unknown | Unknown | Unknown | Unknown | -37.7908 | 144.9634 |
| 5 | Insect | ANIMALIA | ARTHROPODA | INSECTA | Unknown | Unknown | Unknown | Unknown | Unknown | -37.8216 | 144.9706 |
| 6 | Insect | ANIMALIA | ARTHROPODA | INSECTA | Unknown | Unknown | Unknown | Unknown | Unknown | -37.7986 | 144.9606 |
| 7 | Insect | ANIMALIA | ARTHROPODA | INSECTA | Unknown | Unknown | Unknown | Unknown | Unknown | -37.7918 | 144.9564 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1359 | Mammal | ANIMALIA | CHORDATA | MAMMALIA | CHIROPTERA | MOLOSSIDAE | Mormopterus | None | None | -37.8135 | 144.9786 |
| 1360 | Mammal | ANIMALIA | CHORDATA | MAMMALIA | CHIROPTERA | VESPERTILIONIDAE | Myotis | macropus | Large-footed Myotis | -37.8312 | 144.9804 |
| 1361 | Mammal | ANIMALIA | CHORDATA | MAMMALIA | VESPERTILIONIDAE | MOLOSSIDAE | Scotorepens | None | None | -37.8312 | 144.9804 |
| 1362 | Mammal | ANIMALIA | CHORDATA | MAMMALIA | VESPERTILIONIDAE | MOLOSSIDAE | Scotorepens | None | None | -37.8135 | 144.9786 |
| 1363 | Mammal | ANIMALIA | CHORDATA | MAMMALIA | CHIROPTERA | VEPSERTILIONIDAE | Nyctophilus | None | None | -37.8312 | 144.9804 |
1322 rows × 11 columns
# Check for NaN values in latitude and longitude
nan_latitude = merged_cleaned['latitude'].isna().sum()
nan_longitude = merged_cleaned['longitude'].isna().sum()
print(f"Number of NaN values in latitude: {nan_latitude}")
print(f"Number of NaN values in longitude: {nan_longitude}")
Number of NaN values in latitude: 0 Number of NaN values in longitude: 0
Mapping of bats based on their unique locations¶
# Filter for entries where the order is either 'CHIROPTERA' or 'VESPERTILIONIDAE'
bats_data = merged_cleaned[
merged_cleaned['order'].str.upper().isin(['CHIROPTERA', 'VESPERTILIONIDAE'])
]
# Summary of the filtered data
print("Summary of Bat Observations:")
print(f"Total records: {bats_data.shape[0]}")
print(f"Unique species: {bats_data['species'].nunique()}")
common_species = bats_data['species'].mode().values
print(f"Common species: {common_species if common_species.size > 0 else 'None'}")
print(f"Unique locations: {bats_data[['latitude', 'longitude']].dropna().drop_duplicates().shape[0]}")
# Create a map centered around the approximate locations of the bat observations
bat_map = folium.Map(location=[-37.8311, 144.9452], zoom_start=12)
# Add markers for each bat observation
for idx, row in bats_data.dropna(subset=['latitude', 'longitude']).iterrows():
species_info = f"{row['genus']} {row['species']} - {row['common_name']}"
folium.Marker(
location=[row['latitude'], row['longitude']],
popup=species_info,
icon=folium.Icon(color='red', icon='glyphicon-tint')
).add_to(bat_map)
# Display the map directly in Jupyter (optional)
bat_map
Summary of Bat Observations: Total records: 17 Unique species: 5 Common species: ['poliocephalus'] Unique locations: 7
Piechart of bat species distribution¶
The pie chart of bat species distribution visualizes the relative frequencies of different bat species in the dataset. Each slice of the pie chart represents a specific species, with the size of the slice corresponding to the proportion of observations of that species in comparison to the total observations across all species.
- Quantitative Comparison: This chart provides a quick and easy way to compare how common each bat species is within the studied area. Larger slices indicate more commonly observed species, while smaller slices represent less frequent ones.
- Diversity Insight: It helps in understanding the biodiversity of bats in the region by showing the variety of species and their relative abundance.
- Conservation Priorities: For conservation efforts, knowing which species are more or less common can help prioritize actions, especially if some of the less common species are also known to be at risk.
# Ensure the species column does not have too many unique categories
species_counts = bats_data['species'].value_counts()
top_species = species_counts.head(10) # You can adjust to include more species if needed
# Create a pie chart
plt.figure(figsize=(10, 7))
plt.pie(top_species, labels=top_species.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Bat Species')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
Boxplot of species richness¶
Species richness refers to the number of different species present in a given ecological community, region, or habitat. It is a measure of biodiversity that does not account for the abundance of species, only their presence.
import matplotlib.pyplot as plt
import seaborn as sns
# Calculate species richness (number of unique species)
species_richness = bats_data['species'].nunique()
# Calculate the distribution of observations per species
species_distribution = bats_data['species'].value_counts()
# Plot the distribution of observations per species
plt.figure(figsize=(12, 6))
sns.barplot(x=species_distribution.index, y=species_distribution.values, palette="viridis")
plt.title('Distribution of Bat Observations by Species')
plt.xlabel('Species')
plt.ylabel('Number of Observations')
plt.xticks(rotation=45)
plt.show()
print(f"Species Richness: {species_richness}")
Species Richness: 5
Heatmap¶
The intensity of colors illustrates the density of bat observations across different geographic regions or shows correlations between various environmental factors such as temperature and bat activity. By displaying data on a color gradient, heatmaps allow researchers and conservationists to quickly identify hotspots of bat activity, understand habitat preferences, and discern patterns that may influence bat behavior.
from folium.plugins import HeatMap
# Create a map centered around the approximate locations of the bat observations
heat_map = folium.Map(location=[-37.8311, 144.9452], zoom_start=12)
# Add a heat map layer
heat_data = [[row['latitude'], row['longitude']] for index, row in bats_data.dropna(subset=['latitude', 'longitude']).iterrows()]
HeatMap(heat_data).add_to(heat_map)
# Display the map directly in Jupyter (optional)
heat_map
Clustering the Bat data¶
from folium.plugins import MarkerCluster
# Create a map centered around the average coordinates
bat_map = folium.Map(location=[bats_data['latitude'].mean(), bats_data['longitude'].mean()], zoom_start=12)
# Create a marker cluster
marker_cluster = MarkerCluster().add_to(bat_map)
# Add markers to the cluster instead of the map
for idx, row in bats_data.dropna(subset=['latitude', 'longitude']).iterrows():
species_info = f"{row['genus']} {row['species']} - {row['common_name']}"
folium.Marker(
location=[row['latitude'], row['longitude']],
popup=species_info,
icon=folium.Icon(color='red', icon='glyphicon-tint')
).add_to(marker_cluster)
# display the map
bat_map
Butterfly Biodiversity¶
butterfly
| site | sloc | walk | sighting_date | time | vegwalktime | vegspecies | vegfamily | latitude | longitude | ... | tabe | brow | csem | aand | jvil | paur | ogyr | gmac | datetime | geopoint | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Womens Peace Gardens | 2 | 1 | 2017-02-26 | 11:42:00 | 1.3128 | Schinus molle | Anacardiaceae | -37.7912 | 144.9244 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-02-26 11:42:00+00:00 | -37.7912, 144.9244 |
| 1 | Argyle Square | 1 | 1 | 2017-11-02 | 10:30:00 | 0.3051 | Rosmarinus officinalis | Lamiaceae | -37.8023 | 144.9665 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-02-11 10:30:00+00:00 | -37.8023, 144.9665 |
| 2 | Argyle Square | 2 | 1 | 2017-12-01 | 10:35:00 | 0.3620 | Euphorbia sp. | Euphorbiaceae | -37.8026 | 144.9665 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-01-12 10:35:00+00:00 | -37.8026, 144.9665 |
| 3 | Westgate Park | 4 | 1 | 2017-03-03 | 11:44:00 | 3.1585 | Melaleuca lanceolata | Myrtaceae | -37.8316 | 144.9089 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-03-03 11:44:00+00:00 | -37.8316, 144.9089 |
| 4 | Argyle Square | 1 | 3 | 2017-01-15 | 12:33:00 | 0.4432 | Mentha sp. | Lamiaceae | -37.8027 | 144.9662 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-01-15 12:33:00+00:00 | -37.8027, 144.9662 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4051 | Fitzroy-Treasury Gardens | 3 | 2 | 2017-06-02 | 17:44:00 | 0.5132 | Tagetes sp. | Asteraceae | -37.8136 | 144.9819 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-02-06 17:44:00+00:00 | -37.8136, 144.9819 |
| 4052 | Westgate Park | 4 | 2 | 2017-02-02 | 13:57:00 | 2.1947 | Myoporum parvifolium | Scrophulariaceae | -37.8311 | 144.9092 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-02-02 13:57:00+00:00 | -37.8311, 144.9092 |
| 4053 | Westgate Park | 5 | 3 | 2017-06-03 | 15:43:00 | 4.2408 | Cassinia arcuata | Asteraceae | -37.8299 | 144.9106 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-03-06 15:43:00+00:00 | -37.8299, 144.9106 |
| 4054 | Westgate Park | 4 | 1 | 2017-02-02 | 11:05:00 | 1.5948 | Xerochrysum viscosum | Asteraceae | -37.8316 | 144.9093 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-02-02 11:05:00+00:00 | -37.8316, 144.9093 |
| 4055 | Carlton Gardens South | 3 | 1 | 2017-01-30 | 12:42:00 | 1.4437 | Asteraceae 1 | Asteraceae | -37.8044 | 144.9704 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017-01-30 12:42:00+00:00 | -37.8044, 144.9704 |
4056 rows × 42 columns
Number of unique family and species¶
# Calculate the number of unique species
unique_species = butterfly['vegspecies'].nunique()
# Calculate the number of unique families
unique_families = butterfly['vegfamily'].nunique()
print(f"Number of unique vegetation species: {unique_species}")
print(f"Number of unique vegetation families: {unique_families}")
Number of unique vegetation species: 134 Number of unique vegetation families: 59
Summary of the number of sightings at each site¶
# Count the number of sightings at each site
sightings_per_site = butterfly['site'].value_counts()
# Bar plot of the number of sightings per site
plt.figure(figsize=(12, 8))
sightings_per_site.plot(kind='bar')
plt.title('Number of Sightings per Site')
plt.xlabel('Site')
plt.ylabel('Sightings Count')
plt.xticks(rotation=45, ha='right') # Rotate the x labels for better readability
plt.tight_layout() # Adjust layout
plt.show()
Average temperature and humidity of each site¶
# We first group the data by 'site' and calculate the mean for the 'temp' and 'hum' columns
site_comparison = butterfly.groupby('site').agg({'temp':'mean', 'hum':'mean'}).reset_index()
# Now let's create a bar plot for average temperature by site
plt.figure(figsize=(12, 8))
sns.barplot(x='site', y='temp', data=site_comparison, palette='coolwarm')
plt.title('Average Temperature by Site')
plt.xlabel('Site')
plt.ylabel('Average Temperature (°C)')
plt.xticks(rotation=45, ha='right') # Rotate the x labels for better readability
plt.tight_layout() # Adjust layout
plt.show()
# And a bar plot for average humidity by site
plt.figure(figsize=(12, 8))
sns.barplot(x='site', y='hum', data=site_comparison, palette='coolwarm')
plt.title('Average Humidity by Site')
plt.xlabel('Site')
plt.ylabel('Average Humidity (%)')
plt.xticks(rotation=45, ha='right') # Rotate the x labels for better readability
plt.tight_layout() # Adjust layout
plt.show()
Top 10 Butterfly family¶
# Count occurrences of each vegetation family
vegfamily_counts = butterfly['vegfamily'].value_counts()
# Number of unique families
unique_families = len(vegfamily_counts)
# Print the number of unique families
print(f'Number of unique vegetation families: {unique_families}')
# Print the top 10 families
print('Top 10 vegetation families by occurrence:')
print(vegfamily_counts.head(10))
Number of unique vegetation families: 59 Top 10 vegetation families by occurrence: vegfamily Asteraceae 644 Fabaceae 544 Lamiaceae 444 Myrtaceae 152 Plumbaginaceae 148 Anacardiaceae 140 Goodeniaceae 124 Campanulaceae 116 Brassicaceae 112 Pittosporaceae 112 Name: count, dtype: int64
Family histogram¶
# Count occurrences of each vegetation family
vegfamily_counts = butterfly['vegfamily'].value_counts()
# Bar plot
plt.figure(figsize=(12, 8))
vegfamily_counts.plot(kind='bar', color='cadetblue')
plt.title('Distribution of Vegetation Families')
plt.xlabel('Vegetation Family')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right') # Rotate labels to improve readability
plt.tight_layout() # Adjust layout to make room for label rotation
plt.show()
Unique Species count¶
# Count occurrences of each vegetation species
vegspecies_counts = butterfly['vegspecies'].value_counts()
# Number of unique species
unique_species = len(vegspecies_counts)
# Print the number of unique species
print(f'Number of unique vegetation species: {unique_species}')
# Print the top 10 species
print('Top 10 vegetation species by occurrence:')
print(vegspecies_counts.head(10))
Number of unique vegetation species: 134 Top 10 vegetation species by occurrence: vegspecies Trifolium repens 412 Asteraceae 1 244 Schinus molle 140 Goodenia ovata 124 Wahlenbergia sp. 116 Bursaria spinosa 112 Raphanus raphanistrum 112 Galenia pubescens 96 Salvia sp. 88 Canna generalis 88 Name: count, dtype: int64
Temperature distribution¶
# Temperature histogram
plt.figure(figsize=(8, 6))
plt.hist(butterfly['temp'], bins=20, color='skyblue', edgecolor='black')
plt.title('Temperature Distribution')
plt.xlabel('Temperature (°C)')
plt.ylabel('Frequency')
plt.show()
Humidity distribution¶
# Humidity histogram
plt.figure(figsize=(8, 6))
plt.hist(butterfly['hum'], bins=20, color='lightgreen', edgecolor='black')
plt.title('Humidity Distribution')
plt.xlabel('Humidity (%)')
plt.ylabel('Frequency')
plt.show()
Butterfly sightings¶
# Convert the sighting_date to datetime format
butterfly['sighting_date'] = pd.to_datetime(butterfly['sighting_date'])
# Summing example butterfly counts (using real columns from your dataset if available)
butterfly['total_sightings'] = butterfly[['blue', 'dpet', 'dple']].sum(axis=1)
# Group by date and sum sightings
sightings_by_date = butterfly.groupby('sighting_date')['total_sightings'].sum()
# Plotting
plt.figure(figsize=(12, 6))
plt.plot(sightings_by_date)
plt.title('Butterfly Sightings Over Time')
plt.xlabel('Date')
plt.ylabel('Total Sightings')
plt.grid(True)
plt.show()
Temperature and humidity boxplot¶
# Creating box plots for Temperature grouped by Vegetation Family
plt.figure(figsize=(12, 8))
sns.boxplot(x='vegfamily', y='temp', data=butterfly)
plt.title('Temperature Distribution by Vegetation Family')
plt.xlabel('Vegetation Family')
plt.ylabel('Temperature (°C)')
plt.xticks(rotation=90) # Rotate the x labels for better readability
plt.tight_layout() # Adjust layout
plt.show()
# Creating box plots for Humidity grouped by Vegetation Family
plt.figure(figsize=(12, 8))
sns.boxplot(x='vegfamily', y='hum', data=butterfly)
plt.title('Humidity Distribution by Vegetation Family')
plt.xlabel('Vegetation Family')
plt.ylabel('Humidity (%)')
plt.xticks(rotation=90) # Rotate the x labels for better readability
plt.tight_layout() # Adjust layout
plt.show()
plt.figure(figsize=(14, 10)) # Adjusted figure size
sns.boxplot(y='vegfamily', x='temp', data=butterfly)
plt.title('Temperature Distribution by Vegetation Family')
plt.ylabel('Vegetation Family')
plt.xlabel('Temperature (°C)')
plt.tight_layout() # Adjust layout
plt.show()
Spread and Distribution: Each box plot represents the spread and central tendency of temperatures observed for each vegetation family. The bottom and top of each box are the first and third quartiles, and the band inside the box is the median. The whiskers extend to show the range of the data, and points outside of these are considered outliers.
Temperature Ranges: There is a variety in temperature ranges across different vegetation families. Some families have a wider range of temperatures where butterflies have been sighted, indicated by longer boxes and whiskers. Other families have a more narrow range, shown by shorter boxes and whiskers.
Outliers: There are a few outliers present in several vegetation families. Outliers are the individual points that occur far away from the general cluster of data points, indicated by the diamonds outside of the whiskers. These could represent days with unusually high or low temperatures for sightings associated with those vegetation families.
Butterfly scatterplot¶
# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(butterfly['temp'], butterfly['hum'], alpha=0.5)
plt.title('Scatter Plot of Temperature vs. Humidity')
plt.xlabel('Temperature (°C)')
plt.ylabel('Humidity (%)')
plt.grid(True)
plt.show()
Mapping all data points¶
# Create a map centered around Melbourne
all_points_map = folium.Map(location=[-37.8136, 144.9631], zoom_start=12)
# Add markers for each butterfly observation
for idx, row in butterfly.dropna(subset=['latitude', 'longitude']).iterrows():
species_info = f"{row['vegspecies']} - {row['vegfamily']}" # Update with relevant information
folium.Marker(
location=[row['latitude'], row['longitude']],
popup=species_info
).add_to(all_points_map)
# Display the map
all_points_map
Creating clusters¶
# Create a map centered around Melbourne
cluster_map = folium.Map(location=[-37.8136, 144.9631], zoom_start=12)
# Create a MarkerCluster object
marker_cluster = MarkerCluster().add_to(cluster_map)
# Add clustered markers for each butterfly observation
for idx, row in butterfly.dropna(subset=['latitude', 'longitude']).iterrows():
species_info = f"{row['vegspecies']} - {row['vegfamily']}" # Update with relevant information
folium.Marker(
location=[row['latitude'], row['longitude']],
popup=species_info
).add_to(marker_cluster)
# Display the map
cluster_map
Heatmap¶
# Create a map centered around Melbourne
heatmap_map = folium.Map(location=[-37.8136, 144.9631], zoom_start=12)
# Prepare data for HeatMap
heatmap_data = butterfly[['latitude', 'longitude']].dropna().values.tolist()
# Add HeatMap layer
HeatMap(heatmap_data).add_to(heatmap_map)
# Display the map
heatmap_map
Temperature and humidity correlation matrix heatmap¶
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Selecting only numeric columns for correlation - adjust this list as necessary
numeric_columns = ['temp', 'hum'] # Add other numeric columns as needed
butterfly_numeric = butterfly[numeric_columns]
# Calculating the correlation matrix
correlation_matrix = butterfly_numeric.corr()
# Creating the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm',
xticklabels=correlation_matrix.columns,
yticklabels=correlation_matrix.columns)
# Showing the plot
plt.title('Correlation Heatmap of Butterfly Dataset Variables')
plt.show()
Temperature (temp) and Humidity (hum) Relationship: There is a negative correlation of -0.67 between temperature and humidity. This indicates a moderately strong inverse relationship, meaning that as temperature increases, humidity tends to decrease, and vice versa within the dataset's observations.
Strength of Correlation: The value of -0.67 is not close to -1, which means the relationship, while negative, is not perfectly linear and other factors may also influence the observed humidity and temperature values.
Part 3: Predicting Presence of a Species¶
# Create a binary target variable indicating the presence of species in the Asteraceae family
butterfly['target'] = (butterfly['vegfamily'] == 'Asteraceae').astype(int)
# Select a subset of potentially relevant features
features = butterfly[['site', 'sloc', 'walk', 'time', 'vegwalktime', 'latitude', 'longitude']]
# Convert categorical features to numerical codes using Label Encoder
label_encoders = {}
for column in features.select_dtypes(include=['object']).columns:
le = LabelEncoder()
features[column] = le.fit_transform(features[column])
label_encoders[column] = le
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, butterfly['target'], test_size=0.2, random_state=42)
# Print the shapes of the training and testing data
print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)
Training set shape: (3244, 7) Testing set shape: (812, 7)
The dataset was split into training (3244 samples) and testing (812 samples) sets.
The logistic regression model¶
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# Impute missing values
# For numerical features, use the mean
# For categorical features, use the most frequent value
numerical_imputer = SimpleImputer(strategy='mean')
categorical_imputer = SimpleImputer(strategy='most_frequent')
numerical_cols = X_train.select_dtypes(include=['float64', 'int64']).columns
categorical_cols = X_train.select_dtypes(include=['object']).columns
X_train[numerical_cols] = numerical_imputer.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = numerical_imputer.transform(X_test[numerical_cols])
if len(categorical_cols) > 0:
X_train[categorical_cols] = categorical_imputer.fit_transform(X_train[categorical_cols])
X_test[categorical_cols] = categorical_imputer.transform(X_test[categorical_cols])
# Initialize and train the logistic regression model
logreg_model = LogisticRegression(max_iter=1000)
logreg_model.fit(X_train, y_train)
# Predict on the testing set and evaluate the model
y_pred = logreg_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:\n", class_report)
Accuracy: 0.8165024630541872
Classification Report:
precision recall f1-score support
0 0.82 1.00 0.90 663
1 0.00 0.00 0.00 149
accuracy 0.82 812
macro avg 0.41 0.50 0.45 812
weighted avg 0.67 0.82 0.73 812
Accuracy: 81.65%
Class 0 (Negative)
- Precision: 82% - Of all the predictions for class 0, 82% were correct.
- Recall: 100% - The model successfully identified all actual instances of class 0, which reflects high sensitivity for this class.
- F1-Score: 90% - A high F1-score indicates excellent model performance for the negative class.
Class 1 (Positive)
- Precision: 0% - This indicates that there were no correct predictions for class 1; thus, precision is not applicable in this context due to no predicted positive instances.
- Recall: 0% - The model failed to correctly identify any actual instances of class 1, indicating a complete lack of sensitivity for this class.
- F1-Score: 0% - This extremely low F1-score underscores poor performance for the positive class, with significant room for improvement.
Classification Report: The report indicates a high accuracy, but a deeper look reveals some issues. Specifically, the model predicts all instances as the majority class (class 0, or 'no presence of Asteraceae species'). This is evident from the precision, recall, and F1-score for class 1 being 0, which indicates that the model fails to identify any positive cases of the Asteraceae presence correctly. This is a common issue in imbalanced datasets where one class significantly outnumbers the other. The model tends to favor the majority class at the expense of the minority class.
Confusion Matrix¶
Visualizing the true positives, true negatives, false positives, and false negatives.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt
# Initialize and train the logistic regression model with class weight 'balanced'
logreg_balanced = LogisticRegression(max_iter=1000, class_weight='balanced')
logreg_balanced.fit(X_train, y_train)
# Predict on the testing set
y_pred_balanced = logreg_balanced.predict(X_test)
y_pred_proba_balanced = logreg_balanced.predict_proba(X_test)[:, 1] # probabilities for the positive class
# Compute confusion matrix
conf_matrix_balanced = confusion_matrix(y_test, y_pred_balanced)
# Compute ROC curve and AUC
fpr, tpr, _ = roc_curve(y_test, y_pred_proba_balanced)
roc_auc = auc(fpr, tpr)
# Plotting the confusion matrix
plt.figure(figsize=(6, 5))
plt.imshow(conf_matrix_balanced, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix - Balanced Classes')
plt.colorbar()
tick_marks = np.arange(2)
plt.xticks(tick_marks, ['Negative', 'Positive'], rotation=45)
plt.yticks(tick_marks, ['Negative', 'Positive'])
for i in range(conf_matrix_balanced.shape[0]):
for j in range(conf_matrix_balanced.shape[1]):
plt.text(j, i, conf_matrix_balanced[i, j], horizontalalignment="center", color="white" if conf_matrix_balanced[i, j] > conf_matrix_balanced.max()/2 else "black")
plt.tight_layout()
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()
Confusion Matrix:
- True Negatives (Top-Left): 309 - The model correctly predicted 'no presence' of Asteraceae species.
- False Positives (Top-Right): 354 - The model incorrectly predicted 'presence' when there was none.
- False Negatives (Bottom-Left): 62 - The model failed to predict 'presence' when there was.
- True Positives (Bottom-Right): 87 - The model correctly predicted 'presence' of Asteraceae species.
ROC Curve and AUC Score¶
Evaluate the performance across different thresholds, showing the trade-off between sensitivity and specificity.
# Plotting the ROC curve
plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.tight_layout()
plt.show()
The AUC (Area Under the Curve) score is 0.56, indicating moderate performance, which is relatively low, suggesting the model struggles to distinguish between the classes effectively. The ROC curve is near the diagonal line (random guess line), reflecting this modest performance.
0.5: Represents a model that makes predictions no better than random guessing.
1.0: Represents a perfect model that classifies all positive and negative examples correctly.
AUC of 0.56 suggests the model is only slightly better than random guessing.
Due to the closeness of the ROC curve to the diagonal line of no-discrimination further indicates that the model's performance is not very strong.
implement oversampling and undersampling¶
Oversampling the Minority Class
from sklearn.metrics import confusion_matrix, roc_auc_score
from imblearn.over_sampling import RandomOverSampler
# Initialize the RandomOverSampler object
ros = RandomOverSampler(random_state=42)
# Resample the dataset
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)
# Initialize and train the logistic regression model on the oversampled data
logreg_ros = LogisticRegression(max_iter=1000)
logreg_ros.fit(X_train_ros, y_train_ros)
# Predict on the testing set
y_pred_ros = logreg_ros.predict(X_test)
y_pred_proba_ros = logreg_ros.predict_proba(X_test)[:, 1]
# Evaluate the model
print("Confusion Matrix for Oversampled Data:")
print(confusion_matrix(y_test, y_pred_ros))
print("ROC AUC for Oversampled Data:", roc_auc_score(y_test, y_pred_proba_ros))
Confusion Matrix for Oversampled Data: [[312 351] [ 64 85]] ROC AUC for Oversampled Data: 0.556540840393979
Undersampling the Majority Class
from imblearn.under_sampling import RandomUnderSampler
# Initialize the RandomUnderSampler object
rus = RandomUnderSampler(random_state=42)
# Resample the dataset
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)
# Initialize and train the logistic regression model on the undersampled data
logreg_rus = LogisticRegression(max_iter=1000)
logreg_rus.fit(X_train_rus, y_train_rus)
# Predict on the testing set
y_pred_rus = logreg_rus.predict(X_test)
y_pred_proba_rus = logreg_rus.predict_proba(X_test)[:, 1]
# Evaluate the model
print("Confusion Matrix for Undersampled Data:")
print(confusion_matrix(y_test, y_pred_rus))
print("ROC AUC for Undersampled Data:", roc_auc_score(y_test, y_pred_proba_rus))
Confusion Matrix for Undersampled Data: [[346 317] [ 71 78]] ROC AUC for Undersampled Data: 0.5685363458754695
ROC curves¶
# Plotting ROC curves for all models
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label='Original Balanced (AUC = %0.2f)' % 0.560)
plt.plot(roc_curve(y_test, y_pred_proba_ros)[0], roc_curve(y_test, y_pred_proba_ros)[1], label='Oversampled (AUC = %0.2f)' % 0.557)
plt.plot(roc_curve(y_test, y_pred_proba_rus)[0], roc_curve(y_test, y_pred_proba_rus)[1], label='Undersampled (AUC = %0.2f)' % 0.569)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves Comparison')
plt.legend(loc='lower right')
plt.show()
Random Forest¶
from sklearn.ensemble import RandomForestClassifier
# Initialize the Random Forest model
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the model
random_forest.fit(X_train, y_train)
# Predict on the testing set
y_pred_rf = random_forest.predict(X_test)
y_pred_proba_rf = random_forest.predict_proba(X_test)[:, 1]
# Calculate metrics
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)
roc_auc_rf = roc_auc_score(y_test, y_pred_proba_rf)
print("Random Forest Confusion Matrix:\n", conf_matrix_rf)
print("Random Forest ROC AUC:", roc_auc_rf)
Random Forest Confusion Matrix: [[663 0] [ 0 149]] Random Forest ROC AUC: 1.0000000000000002
The results from the Random Forest model showed a good performance with an ROC AUC of 1.0 and a confusion matrix indicating no false positives or false negatives.
Gradient Boosting¶
from sklearn.ensemble import GradientBoostingClassifier
# Initialize the Gradient Boosting model
gradient_boosting = GradientBoostingClassifier(n_estimators=100, random_state=42)
# Train the model
gradient_boosting.fit(X_train, y_train)
# Predict on the testing set
y_pred_gb = gradient_boosting.predict(X_test)
y_pred_proba_gb = gradient_boosting.predict_proba(X_test)[:, 1]
# Calculate metrics
conf_matrix_gb = confusion_matrix(y_test, y_pred_gb)
roc_auc_gb = roc_auc_score(y_test, y_pred_proba_gb)
print("Gradient Boosting Confusion Matrix:\n", conf_matrix_gb)
print("Gradient Boosting ROC AUC:", roc_auc_gb)
Gradient Boosting Confusion Matrix: [[663 0] [112 37]] Gradient Boosting ROC AUC: 0.8727464140018424
Plotting ROC curves for both models¶
plt.figure(figsize=(8, 6))
plt.plot(roc_curve(y_test, y_pred_proba_rf)[0], roc_curve(y_test, y_pred_proba_rf)[1], label='Random Forest (AUC = 1.00)')
plt.plot(roc_curve(y_test, y_pred_proba_gb)[0], roc_curve(y_test, y_pred_proba_gb)[1], label='Gradient Boosting (AUC = 0.873)')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves Comparison')
plt.legend(loc='lower right')
plt.show()
Feature importance¶
# Get feature importances from both models
importances_rf = random_forest.feature_importances_
importances_gb = gradient_boosting.feature_importances_
# Summarize feature importances in a DataFrame
feature_names = X_train.columns
importances_df = pd.DataFrame({
'Feature': feature_names,
'Importance_RF': importances_rf,
'Importance_GB': importances_gb
}).sort_values(by='Importance_GB', ascending=False)
# Plotting feature importances
fig, ax = plt.subplots(2, 1, figsize=(12, 12))
importances_df.plot(kind='barh', x='Feature', y='Importance_RF', ax=ax[0], color='blue', title='Random Forest Feature Importance')
importances_df.plot(kind='barh', x='Feature', y='Importance_GB', ax=ax[1], color='green', title='Gradient Boosting Feature Importance')
plt.tight_layout()
plt.show()
Random Forest: The site feature is the most significant, followed by latitude and sloc. This indicates that the location-related features play a crucial role in the Random Forest model's predictions. time seems to be the least important, which suggests that the timing of the observations isn't as critical to the model.
Gradient Boosting: The site feature also leads in importance, similar to Random Forest, reinforcing the importance of location-related features in determining the presence of the species. walk and sloc also show considerable influence, which might indicate that specific conditions or characteristics captured by these features significantly impact the model.
import shap
# Create a SHAP explainer object for Gradient Boosting model
explainer = shap.TreeExplainer(gradient_boosting)
shap_values = explainer.shap_values(X_test)
# Plot SHAP values for the first 10 predictions
shap.summary_plot(shap_values, X_test, plot_type="bar")
The significant features across both models are primarily related to location (site, latitude, longitude), which might be due to ecological factors specific to certain locations influencing the presence of the species.
Cross validation¶
from sklearn.model_selection import cross_val_score
# Set up k-fold cross-validation
k = 5 # Number of folds
# Random Forest cross-validation for accuracy
rf_cv_accuracy = cross_val_score(random_forest, X_train, y_train, cv=k, scoring='accuracy')
# Gradient Boosting cross-validation for accuracy
gb_cv_accuracy = cross_val_score(gradient_boosting, X_train, y_train, cv=k, scoring='accuracy')
print("Random Forest Average CV Accuracy:", np.mean(rf_cv_accuracy))
print("Gradient Boosting Average CV Accuracy:", np.mean(gb_cv_accuracy))
Random Forest Average CV Accuracy: 0.9916733245829292 Gradient Boosting Average CV Accuracy: 0.8800852213281593
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# Parameter grid for Random Forest
param_grid_rf = {
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# Setup the grid search
grid_search_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_rf.fit(X_train, y_train)
print("Best parameters for Random Forest:", grid_search_rf.best_params_)
print("Best cross-validated accuracy for Random Forest:", grid_search_rf.best_score_)
Best parameters for Random Forest: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best cross-validated accuracy for Random Forest: 0.9916733245829292
from sklearn.ensemble import GradientBoostingClassifier
# Parameter grid for Gradient Boosting
param_grid_gb = {
'n_estimators': [100, 200, 300],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 4, 5]
}
# Setup the grid search
grid_search_gb = GridSearchCV(GradientBoostingClassifier(random_state=42), param_grid_gb, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_gb.fit(X_train, y_train)
print("Best parameters for Gradient Boosting:", grid_search_gb.best_params_)
print("Best cross-validated accuracy for Gradient Boosting:", grid_search_gb.best_score_)
Best parameters for Gradient Boosting: {'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 200}
Best cross-validated accuracy for Gradient Boosting: 0.9919819665582379